Measuring and reducing the carbon footprint of fMRI preprocessing in fMRIPrep

Abstract Computationally expensive data processing in neuroimaging research places demands on energy consumption—and the resulting carbon emissions contribute to the climate crisis. We measured the carbon footprint of the functional magnetic resonance imaging (fMRI) preprocessing tool fMRIPrep, testing the effect of varying parameters on estimated carbon emissions and preprocessing performance. Performance was quantified using (a) statistical individual‐level task activation in regions of interest and (b) mean smoothness of preprocessed data. Eight variants of fMRIPrep were run with 257 participants who had completed an fMRI stop signal task (the same data also used in the original validation of fMRIPrep). Some variants led to substantial reductions in carbon emissions without sacrificing data quality: for instance, disabling FreeSurfer surface reconstruction reduced carbon emissions by 48%. We provide six recommendations for minimising emissions without compromising performance. By varying parameters and computational resources, neuroimagers can substantially reduce the carbon footprint of their preprocessing. This is one aspect of our research carbon footprint over which neuroimagers have control and agency to act upon.

parameters on estimated carbon emissions and preprocessing performance.Performance was quantified using (a) statistical individual-level task activation in regions of interest and (b) mean smoothness of preprocessed data.Eight variants of fMRIPrep were run with 257 participants who had completed an fMRI stop signal task (the same data also used in the original validation of fMRIPrep).Some variants led to substantial reductions in carbon emissions without sacrificing data quality: for instance, disabling FreeSurfer surface reconstruction reduced carbon emissions by 48%.We provide six recommendations for minimising emissions without compromising performance.By varying parameters and computational resources, neuroimagers can substantially reduce the carbon footprint of their preprocessing.This is one aspect of our research carbon footprint over which neuroimagers have control and agency to act upon.
• Switching off FreeSurfer surface construction can reduce carbon emissions by up to 48% with no loss in preprocessing performance.

| INTRODUCTION
We are in the midst of a climate crisis.According to the IPCC (2023), the release of greenhouse gases as a result of human activity has led to $1.09 C warming of global surface temperatures, when comparing the period of 2011-2020 to pre-industrial levels .More recent estimates from NASA suggest that average surface temperature in 2023 was-alarmingly-over 1.3 C warmer than in this same preindustrial period (https://climate.nasa.gov/vital-signs/globaltemperature/).Such warming contributes to extreme weather events, including fires, droughts, and floods, and a collapse in biodiversity (IPCC, 2022).As it is not yet possible to rely on carbon-negative technology (Carton et al., 2020) or offsetting (Watt, 2021), behaviour change is needed to combat this crisis.As scientists, we must accept the contribution of scientific practices (Valero, 2023).This can be seen in biomedical research, including in neuroscience, where carbon intensive practices include extraction of experimental resources, creation of single-use consumables, and long-haul flights to conferences (Aron et al., 2020;Epp et al., 2023;Farley, 2022;Keifer & Summers, 2021;Nathans & Sterling, 2016;Rae, 2022;Rae et al., 2022;Zak et al., 2020).One often neglected aspect of this research footprint is emissions associated with data processing and storage.
The information communication technology sector is responsible for an estimated 1.8%-3.9% of global CO 2 emissions (Freitag et al., 2021)-largely attributable to electricity required to run, maintain, and cool servers (Khalaj et al., 2016).Compute-heavy research fields include machine learning (Anthony et al., 2020), astrophysics (Portegies Zwart, 2020), and bioinformatics (Grealey et al., 2022).Tools have been developed to estimate the carbon emissions of such computingincluding the online calculator Green Algorithms (Lannelongue et al., 2021), server-side tool GA4HPC (https://www.green-algorithms. org/GA4HPC), and embedded packages Carbontracker (Anthony et al., 2020) and CodeCarbon (Goyal-Kamal et al., 2021).Such tools estimate energy used during computing, which is used in conjunction with the carbon intensity of the local energy supply to estimate emissions.This is a useful first step in addressing the footprint of one's computing (Lannelongue et al., 2023;Lannelongue & Inouye, 2023).
Neuroimaging research frequently relies on computationally expensive processing with large datasets (Rae et al., 2022).This includes statistical analysis of brain structure and function, and preprocessing needed to prepare data for analysis.For functional magnetic resonance imaging (fMRI), preprocessing steps include registration into a standard template space, extraction (identifying brain tissue), segmentation between tissue type boundaries, distortion correction, and identification of nuisance regressors (Caballero-Gaudes & Reynolds, 2017).
fMRIPrep is a preprocessing pipeline which combines best practice tools from multiple packages (Esteban et al., 2019(Esteban et al., , 2020)).fMRIPrep runs several options by default, including surface reconstruction in should decrease emissions while having no impact on preprocessing quality.Beyond this, analysis was exploratory.
The results reported in this article have been reproduced by an independent statistician from the University of Sussex (report: https://osf.io/hzu69).
Several deviations were made from this preregistration plan.Each is detailed in the Supplementary Section 'Deviations from preregistration'.

| Participants
We used existing data from the publicly available UCLA Consortium for Neuropsychiatric Phenomics LA5c Study (Bilder et al., 2020;Gorgolewski et al., 2017;

| Design
This study had a repeated measures design, with raw fMRI data for all 257 participants being run through eight fMRIPrep pipeline variants.

| Stop signal task
The procedure used for the stop signal task is described fully in Pol-

| Preprocessing in fMRIPrep
A baseline fMRIPrep pipeline was run on the stop signal task data for all 257 subjects.Flags passed to this pipeline are in Table 1.We ran seven variants on this pipeline for all subjects.Table 2 presents a numeric 'ID', descriptive 'label', and description for each variant.
Largely default settings were used for the baseline pipeline, with the exception of using 'MNI152NLin6Asym:res-2' as an output space, rather than the default of 'MNI152NLin2009cAsym' in native resolution.Doing so allowed for a fairer comparison between the baseline pipeline and the pipeline employing the independent component analysis strategy for the Automatic Removal of Motion Artifacts (ICA-AROMA; P5).Two further pipelines manipulating parallelisation of computing are described and analysed in the Supplementary Section 'Parallelisation analysis' (Tables S1-S3 and Figures S1 and   S2), including a discussion of why they were not included in the main analysis (Figure S3).
Further details on the steps implemented for each pipeline can be seen in the Supplementary Section 'fMRIPrep citation boilerplates'.
All variants of fMRIPrep were run on the University of Sussex high performance cluster, comprising 1250 central processing unit (CPU) cores across $90 nodes.fMRIPrep was run in a singularity container, using version 22.1.1.All jobs were run on nodes using an Intel ® Xeon ® Processor E5-2640 v3. 1 For preprocessing in fMRIPrep for each pipeline, we measured the size (GB) of files generated for each subject.

| Carbon tracking
We had planned to estimate carbon emissions of each pipeline using The only exception being for the supplementary P8 in which parallelisation was increased.
Jobs for the pipeline were necessarily run on nodes using more efficient processors, creating a hardware confound for this pipeline.See supplementary section 'Parallelisation analysis' for more detail.
Instead, we developed an in-house tracking system that relies on retrospective use of HPC scheduler logs.Energy usage (kWh) for CPU was estimated by multiplying total CPU time by the power usage of a CPU core. to that used by the Green Algorithms server-side tool GA4HPC (https://www.green-algorithms.org/GA4HPC).We extracted estimated emissions (kgCO 2 eq), duration of preprocessing (hours), and energy used by CPU and RAM (kWh).
It is important to note that there is currently no gold standard approach for estimating energy usage, and other approaches are available.For instance, one could attempt to define energy usage based on the amount of memory actually used in computing.However, total memory available/reserved for a given job may provide a more accurate measure (Karyakin & Salem, 2017), and this is the approach used in Green Algorithms (Lannelongue et al., 2021).We instead used the maximum memory employed in order to fairly account for how much memory a given task should have used, rather than merely the amount allocated (consistent across pipelines and subjects), given that on our architecture, memory used occasionally exceeded memory requested.
In the current experiment, RAM energy usage for a given job accounted for an average of 28% of total energy usage (CPU + RAM; for P0), broadly consistent with prior evidence (Fukazawa et al., 2018).

| Smoothing and mean smoothness estimation in AFNI
Following preprocessing in fMRIPrep, we followed the procedure for smoothing and estimating smoothness used by Esteban et al. (2019).
The preprocessed blood oxygen level dependent (BOLD) data for the stop signal task was smoothed using the AFNI tool 3dBlurInMask, with a full-width at half-maximum (FWHM) smoothing kernel of 5 mm.Smoothing was masked by the respective subject's brain mask provided by fMRIPrep.Mean geometric smoothness estimates were extracted for each participant's BOLD run both prior to and following smoothing, using AFNI's 3dFWHMx.Outlier voxels were accounted for using the 'detrend' flag.This tool provides a mean smoothness estimate for each of the 184 volumes of the BOLD run on the x, y, and z dimension.For each dimension, outlier smoothness estimates ± three standard deviations (SDs) from the dimension mean were removed.Values were then averaged within and across dimensions to T A B L E 1 Flags passed to the baseline fMRIPrep pipeline.

Flag Description
--ignore slicetiming • Tells fMRIPrep not to perform slicetiming correction-this step was not deemed necessary for this study.
--track-carbon • Runs CodeCarbon during preprocessing, providing a subjectspecific estimate of total carbon emissions, duration, and energy consumption.
--country-code GBR • Specifies the country in which CodeCarbon has been run using the relevant ISO Alpha-3 country code, in this case for Great Britain.
--random-seed 1234 • Initiates a specific seed for the workflow.Along with the two following flags below, this helps to ensure reproducible preprocessing.a --skull-strip-fixed-seed • Avoids using a random seed for skullstripping.Again, this ensures reproducible preprocessing when used with the flag above and below.
--omp-nthreads 1 • The maximum number of threads allocated per process.Setting this at a value of 1 ensures reproducible preprocessing when used with the two flags above.
--nthreads 5 • The number of threads used across all processes, set at a default value of 5.
--mem-mb 3000 • Sets an upper bound memory limit for fMRIPrep processes, at 3000 MB.
--output-spaces MNI152NLin6Asym: res-2 • Specifies the output space in which to register fMRI data.Outputs data to a template with standard 2 mm resolution.
--skip-bids-validation • This flag uses the assumption that input data is BIDS-valid (this was verified using the BIDS Validator prior to preprocessing; https://bids-standard.github.io/bids-validator/).When used, BIDS validation is skipped within fMRIPrep.
Note: Excludes arguments used to identify the location of relevant directories and licences.Values or strings given to arguments (when relevant) are in bold.
produce a single mean smoothness estimate, for both the pre-and post-smoothed data.As in Esteban et al. (2019), larger mean smoothness values were taken as reflecting poorer pipeline performance, on the basis that increased smoothness reflects loss of anatomical specificity and statistical power.Smoothing was conducted for all pipelines with the exception of Pipeline 5, given that smoothing was performed during preprocessing as part of the ICA-AROMA procedure, with a smoothing kernel of 6 mm.As such, we were unable to obtain estimates of pre-smoothed smoothness for this pipeline.

| Analysis in FSL
To statistically measure task-related activation during performance of the stop signal task, we employed general linear modelling using the FEAT function of FSL (Woolrich et al., 2001).In first-level analysis, preprocessing steps were skipped, with the exception of high-pass temporal filtering at 100 Hz.The respective preprocessed BOLD data for the stop signal run was used as the input file for each participant.
In line with the design used by Gorgolewski et al. (2017) and Esteban et al. (2019), six regressors were used as EVs: 1. Go-Trials in the 'go' condition for which participants responded correctly, with a fixed duration of 1.5 s.
2. Go RT-Trials in the 'go' condition for which participants responded correctly.Duration was set as the subject's response time (RT) for the respective trial, rather than being fixed.This EV was orthogonalised to 'go'.
3. Successful stop-Trials in the 'stop' condition for which participants successfully inhibited their motor response, with a fixed duration of 1.5 s.
4. Unsuccessful stop-Trials in the 'stop' condition for which participants did not successfully inhibit their motor response, with a fixed duration of 1.5 s.
5. Unsuccessful stop RT-Trials in the 'stop' condition for which participants did not successfully inhibit their motor response.Duration was set as the subject's RT for the respective trial, rather than being fixed.This EV was orthogonalised to 'stop_unsuccess'.
6. Erroneous-Trials in the 'go' condition for which participants responded incorrectly, with a fixed duration of 1.5 s.Ninety-nine participants had no erroneous 'go' trials.In these cases, task data were modelled with five EVs, with 'Erroneous' omitted.
The onset time of each regressor was set as the start of the 0.5 s fixation that preceded each decision period.• Adds a surface output space alongside the volumetric space.This is predicted to increase carbon emissions while having no effect on preprocessing performance (given that data resampled to this surface space was not used in subsequent analysis steps that analysed data in volumetric space only).

ICA-AROMA --use-aroma
• Adds an independent components analysis (ICA) approach to automatic removal of motion artifacts (AROMA) to the pipeline.Predicted to increase carbon emissions, while improving statistical sensitivity to task activation.
6 Fieldmap-free distortion correction --use-syn-sdc error • fMRIPrep's in-house fieldmap-free distortion correction, using the subject's T1w image as an anatomically unwarped reference.Predicted to increase emissions.This should theoretically improve preprocessing performance.

7
Increase output space resolution --output-spaces MNI152NLin6Asym:res-1 • Increases the resolution of the volumetric output space used to resample anatomical and functional images, from 2 mm to 1 mm.Predicted to increase both carbon emissions and preprocessing performance.
Note: 'flag addition' refers to the respective arguments passed to the fMRIPrep command line (https://fmriprep.org/en/stable/usage.html).Values or strings given to arguments (when relevant) are in bold.Following first-level analysis, activation in two regions of interest (ROIs) was interrogated using FSL Featquery.ROIs were selected on the basis of peak activations in group-level activation maps from analysis of this dataset by Gorgolewski et al. (2017), available on Neurovault (http://neurovault.org/collections/2606).An ROI in the left primary motor cortex (postcentral gyrus) was selected for the 'motor' contrast (go > successful stop).An ROI in the presupplementary motor area (pre-SMA) was selected for the 'response inhibition' contrast (successful stop > go).This region was selected due to its relevance to response inhibition (Rae et al., 2015;Sebastian et al., 2013;Sharp et al., 2010).In both cases, we identified the peak voxel in the respective region in the Gorgolewski et al. (2017) group map and generated a 5 mm sphere around its voxel coordinates.
These ROIs are visualised in Figure 1.We extracted the mean zstatistic value of voxels within each sphere for the relevant contrast ('go' for primary motor cortex, 'successful stop' for pre-SMA).Higher mean z-statistics in these regions, as a proxy for sensitivity to statistical activation (Esteban et al., 2019), were taken to reflect better performance.
We also performed supplementary analysis on two other ROIs relevant to the response inhibition.This included right insula (inferior opercular regions such as the insula have also been evidenced to support response inhibition; Cunillera et al., 2016;Zhang et al., 2017) and the left primary auditory cortex (due to the presentation of an auditory tone in 'stop' trials).Visualisation and analysis of these regions can be seen in Supplementary Section 'Supplementary task activation' (Figures S5 and S6 and Table S4).
Finally, higher-level group analysis was conducted in FSL FEAT (Woolrich et al., 2004), testing for significant activation associated with 'go > successful stop' and 'successful stop > go' across the 257 subjects for each pipeline.A cluster corrected threshold of Z = 3.1, p = .05was used.
A summary of the experimental procedure is provided in Figure 2, alongside the output measures extracted at each stage.

| Data analysis
To summarise, dependent variables include the carbon tracking metrics (1) carbon emissions (kgCO 2 eq), ( 2) duration (hours), (3) CPU energy (kWh), and (4) RAM energy (kWh), mean smoothness estimates for (5) pre-smoothed and ( 6) post-smoothed data, task activation in (7) the left primary motor cortex for motor responses and (8) the pre-SMA for successful response inhibition, and (9) total file size.Prior to analysis, outliers (3 SD above or below the group mean) for the respective variable for the respective pipeline were removed.
For each dependent variable, we then ran a repeated measures ANOVA observing the main effect of pipeline identity, with eight levels corresponding to the pipelines detailed in Table 2 (the ANOVA for pre-smoothed data had seven levels given that this metric was not available for P5).When significant effects of pipeline were observed,  (1961).All ANOVAs and contrasts were run in the statistical package JASP version 0.17.2.1 (JASP Team, 2023).
The computational reproducibility of the results reported in this study has been checked by an independent statistician (see https:// osf.io/hzu69 for the full report).This was conducted as part of the University of Sussex School of Psychology initiative 'Ensuring the computational reproducibility of to-be-submitted psychology papers' (https://doi.org/10.17605/osf.io/dr35v).
Analysis was followed by creation of activation count maps and SD maps for each pipeline, as in Esteban et al. (2019).This allowed for qualitative assessment of the anatomical specificity and variability of each pipeline.The procedure used to generate each of these maps is described in Section 3.2.

| Pipeline comparisons
The mean carbon emissions of each pipeline are plotted against preprocessing performance (task activation and smoothness), duration, and energy usage in Figure 3.The mean file size for each pipeline is in Figure 4. Table 3 provides results of repeated measures ANOVAs used to observe the effect of pipeline identity on each dependent variable.
For each pipeline, effects for each metric relative to baseline (determined by contrasts in Table S5) are parsed below.Given that carbon emissions are exactly proportional to overall energy usage,4 CPU and RAM energy usage are only discussed when their patterns of significance deviate from those of emissions.Here, reductions in energy usage are typically attributable to reduced compute time.As such, percent changes in emissions and duration are typically comparable.
F I G U R E 2 Experimental procedure of preprocessing and analysis steps, with outcome measures derived from each stage.CPU, central processing unit; GLM, generalised linear model; RAM, randomaccess memory.

| Pipeline 1-No FreeSurfer surface reconstruction
Disabling FreeSurfer surface reconstruction reduced emissions and duration by 48% and 49%, respectively.This had no significant effect on statistical sensitivity, and even yielded decreases of 2% and 0.4% in pre-smoothed and post-smoothed smoothness, respectively.5File size decreased by 10%.

| Pipeline 2-'Sloppy' testing mode
Enabling 'sloppy' mode (simplified registration) reduced emissions and duration by 44% and 41%, respectively.Reported activation in the pre-SMA decreased by 10%, and pre-smoothed smoothness increased by 5%.These performance losses were anticipated, given this mode is intended only for testing.File size decreased by 17%.

| Pipeline 3-Low memory mode
Low memory mode, which reduces memory usage at the cost of increasing disk usage in the working directory, reduced both emissions and duration by 6%.This had no impact on preprocessing performance (preprocessed data was identical to baseline).File size increased by 4%.Note that this memory optimisation appears to occur only during the compression of BOLD data.As such, savings in compute are likely to depend on the specific data being used.

| Pipeline 4-Adding a surface output space
Surprisingly, adding a surface output space reduced emissions and duration, by 4% and 5%, respectively (driven by a 4% decrease in CPU energy, but no significant change in RAM energy).Inclusion of a surface space may reduce redundancies in such a way that increases efficiency.
Given that analysis was not performed on the generated surface space, there were no changes in preprocessing performance.File size increased by 17%.While this setting did reduce emissions, we do not recommend it as a route to doing so, given that other settings reduce emissions to a greater degree, and the cause of these reductions is clearer.

| Pipeline 5-Implementing ICA-AROMA
Adding ICA-AROMA had no significant impact on duration, but increased emissions by 3%.Reported activation in the pre-SMA increased by 6%, with no significant change in the primary motor cortex.The mechanisms by which ICA-AROMA may improve performance is beyond the scope of the current analysis, but is discussed in prior papers (e.g., Pruim et al., 2015).Post-smoothing smoothness increased by 13%-to be expected given that AROMA entails a 6 mm smoothing kernel, higher than the 5 mm kernel applied to other pipelines.File size increased by 32%.Note that as of version 23.1.Abbreviations: CPU, central processing unit; pre-SMA, pre-supplementary motor area; RAM, random-access memory; SSP, sum-of-squares-and-products.
increase in emissions was driven by a 224% increase in RAM energy, consistent with a substantial increase in file size of 191%.This was the only pipeline for which RAM energy exceeded CPU energy (see Figure 3d).This manipulation had no significant effect on statistical sensitivity.7However, this pipeline decreased both pre-and postsmoothed smoothness, by 3% and 2%, respectively.

| Visualising anatomical variability and specificity
As in Esteban et al. (2019), we visually represented the anatomical specificity and variability of each pipeline.Activation count maps are presented below, to provide an illustration of qualitative consistency across pipelines.For the sake of brevity, other figures referenced below are in Supplementary Section 'Visualising variability and specificity'.All relevant maps are on Neurovault (https://neurovault. org/collections/JNAGJRUI). In each case, maps for P3 (low memory) and P4 (added surface output space) are identical to P0 (baseline), as these manipulations did not affect fMRIPrep output.
Activation count maps (Figure 5)8 reflect the percentage of subjects showing significant individual-level activation in each voxel for both the 'go > successful stop' and 'successful stop > go' contrasts, thresholded at 5%.Similar qualitative patterns emerged across pipelines.Quantitatively (Table S6), P1 (no FreeSurfer surface reconstruction) and P6 (fieldmap-free distortion correction) reduced the size of thresholded maps relative to P0. P5 (ICA-AROMA) and P7 (increased output space resolution) led to reliable increases, while P2 (sloppy preprocessing) was inconsistent.Similar patterns were observed for thresholded group-level activation maps (Figure S8).volumes.Variation was consistently highest around the brain outline and in anterior temporal/orbitofrontal regions-areas susceptible to distortion (Olman et al., 2009;Stenger, 2006).Particularly high variation around the outline was observed for P2, likely due to poor registration, and for P6, likely due to increased distortion.P6 also had 3% lower average SD than P0 (Table S7), perhaps reflecting poor sensitivity to statistical effects.
Activation SD maps (Figure S10) reflect the sample's mean SD of z-statistics in the unthresholded contrast of 'go > successful stop' for each voxel.Qualitatively, variation was consistently highest in areas implicated in the go or successful stop trials, including left primary motor cortex, bilateral auditory cortex, and lateral temporoparietal regions.P5 had 3% higher SD than P0, while P6 was 5% lower (Table S7), likely reflecting increased and decreased statistical sensitivity, respectively.

| DISCUSSION
Neuroimaging research frequently relies on computationally expensive data preprocessing.Here, we investigated the carbon footprint of preprocessing pipeline fMRIPrep and demonstrated that decreases in emissions do not necessitate reductions in preprocessing performance, as reflected in sensitivity to task activation and data smoothness.Box 1 provides recommendations based on these findings.
Although these recommendations are particularly relevant to neuroimaging, this paper can serve as a model to identify evidenced-based recommendations for mitigating the carbon footprint of digital tools.
Systematically studying the effect of parameters on emissions allows identification of acceptable levels of trade-offs in performance.This has been applied to deep learning, where reductions in dataset size and energy consumption can be made with minimal performance losses (e.g., Bakhtiarifard et al., 2024;Li & Chao, 2021).We encourage researchers both inside and outside neuroimaging to integrate carbon tracking functions (e.g., CodeCarbon [Goyal-Kamal et al., 2021], Carbontracker [Anthony et al., 2020]) into software, or use online calculators such as Green Algorithms (Lannelongue et al., 2021) or server-side tools such as GA4HPC (https://www.green-algorithms.org/GA4HPC).
This will enable researchers to measure carbon emissions and identify routes to reducing them.Tool developers could help here by integrating embedded carbon trackers and turning them on by defaultfor instance, automatic application of CodeCarbon in fMRIPrep.However, as demonstrated in the Supplementary Section 'Complications using CodeCarbon', this tool encounters issues such as with hardware isolation in the context of HPC Testing embedded carbon trackers on a wider variety of hardware and settings will help validation, and end users will benefit from advice on their appropriate application.
There are factors besides data processing parameters that impact the carbon footprint of computing.This includes the time of day that analysis is run-carbon intensity of local energy sources is highest at times of peak usage, at the start and end of the working day (Souter et al., 2023).By scheduling jobs to run at periods of low intensity (see  the data preprocessed here.Though not tested here, the use of scan-specific fieldmaps, when available, is likely to be preferable.
6. Refrain from using high-resolution volumetric output space templates (--output-spaces MNI152NLin6Asym: res-< number>) unless necessary.This drastically increases emissions and storage usage, while not benefitting task sensitivity and only somewhat benefitting smoothness.
Although the energy needed for research computing likely makes up a small fraction of global CO 2 emissions, this is an aspect of their work of which many researchers will have maximal control and knowledge needed to reduce their individual footprint.As the energy required for increasingly complex computing may continue to grow in the coming decades, it is important for individual researchers and tool developers to practice being mindful of their use of computing resources.This effort will be aided by transparency in reporting, using 'Environmental impact statements' (Souter et al., 2023) Best practice for minimising the footprint of research computing may depend on the dataset in question.In cases where a given dataset will be analysed once and never shared or reused, reducing compute by disabling unnecessary steps (e.g., switching off FreeSurfer reconstruction when not required 10 ) may be the best route.If shared and used by others, however, there may be environmental benefits of providing more derivatives rather than fewer.For example, if runtime was halved by removing FreeSurfer, and two secondary researchers then independently run FreeSurfer on this same dataset, there would be a likely net increase in energy consumption.The public sharing and re-use of high-quality preprocessed datasets with comprehensive derivatives (including surface reconstruction files, for example) could be beneficial in increasing the likelihood that individual researchers will not need to re-preprocess data.However, any potential savings should be weighed against the carbon footprint of storing public data on cloud computing services (Souter et al., 2023).In this sense, sharing a relatively small number of high-quality datasets is likely to be more sustainable than many labs sharing a large number of lowerquality ones (e.g., smaller sample size).Striking this balance in the most effective way will require a continued research culture shift to sharing of data that is findable, accessible, interoperable, and reusable (Wilkinson et al., 2016).The carbon footprint of sharing and downloading different data types through cloud computing services is relevant here, and quantification of this footprint would a valuable contribution.However, this is currently challenging given there may not be full transparency from commercial cloud computing providers on carbon emissions arising from their services.
There are limitations to the current study.First, to quantify preprocessing performance, we used statistical task sensitivity and mean data smoothness, similar to metrics used by Esteban et al. (2019) when comparing fMRIPrep to FSL FEAT.Task activation is limited in so far as we cannot assume 'ground truth' activation against which to benchmark performance.We relied on the assumption that a sufficiently robust paradigm should yield predictable hotspots.Other methods for quantifying preprocessing performance exist, such as split-half resampling (Churchill et al., 2015).Second, while carbon tracking tools have been validated as providing reasonable estimates (Jay et al., 2023), they are not perfect.Measures of energy usage rely on the proxy of TDP, which can underestimate usage (Lannelongue et al., 2023).Relatedly, our energy usage calculations do not account for idle power consumption of the HPC system that would be indirectly attributable to submission of our jobs.This decision was made in order to avoid emissions estimates being biased by the number of other jobs, including those from other users, running on the system.(22.1.1.).Future changes to the software (including those which may be specifically aimed at increasing environmental sustainability, such as efficiency of surface reconstruction) may affect both the quality of the output and the required computational resources.
The current results should be interpreted within this context.Finally, this study focused on individual efforts to make research computing more environmentally sustainable.While these efforts are vital, institutional system-level changes in policy and hardware efficiency will also be needed to facilitate change.
In conclusion, we have provided evidence-based recommendations for how neuroimaging researchers can reduce the carbon footprint of fMRIPrep, without sacrificing data quality.We encourage similar investigations of other digital data processing and analysis tools.

| ENVIRONMENTAL IMPACT STATEMENT
Across all subjects and pipelines, preprocessing for the current experiment in fMRIPrep produced an estimated 72.9 kg of carbon dioxide equivalent emissions (kgCO 2 eq), as determined using an in-house server-side carbon tracker.

FreeSurfer
and registration and extraction in Advanced Normalization Tools (ANTs).Its command line interface provides the option to add, remove, or adjust preprocessing steps-which in turn impact runtime, files generated, and computational resources needed.As of version 22.1.0(12 December 2022), CodeCarbon has been integrated into fMRIPrep (https://fmriprep.org/en/stable/changes.html#december-12-2022), allowing users to estimate carbon emissions for a subject's preprocessing.fMRIPrep is therefore well placed to test the tractability of reducing the carbon footprint of a neuroimaging pipeline.Here, we measured the impact of preprocessing parameters in fMRIPrep on carbon emissions across eight pipeline variants.We also tracked preprocessing performance using two metrics used in the original validation of fMRIPrep(Esteban et al., 2019)-statistical task activation and mean data smoothness.In estimating both carbon emissions and preprocessing performance, we aimed to identify settings that minimise carbon emissions while providing little or no trade-off in the quality of derived image phenotypes.While insufficient data exists to formulate specific hypotheses, adding preprocessing steps (e.g., distortion correction) should increase carbon emissions while benefitting pipeline performance.Conversely, making pipelines more computationally efficient (e.g., reducing memory usage) https://openneuro.org/datasets/ds000030/ versions/1.0.0).A 'data descriptor' of this project is provided by Poldrack et al. (2016).This dataset includes 272 participants including those with a diagnosis of ADHD (N = 43), bipolar disorder (N = 29), or schizophrenia (N = 50) as well as neurologically healthy controls (N = 130).Participants completed a series of MRI scans including structural scans, diffusion-weighted imaging, resting state fMRI, and task-based fMRI for six tasks.We analysed functional data for the stop signal task only.Data for this same sample and task were used in the original validation of fMRIPrep (Esteban et al., 2019) when comparing fMRIPrep and FSL FEAT.Participants were necessarily excluded if they did not have data for a T1 structural scan or the single run of the stop signal task (N = 13).Consistent with Esteban et al. (2019), we also excluded participants who had no valid trials classified as successful go (N = 1) or successful stop (N = 1).The final sample included 257 participants (108 female) with a mean age of 33.3 (SD = 9.3).
drack et al. (2016).Participants completed one run of this task during fMRI.Each participant completed 128 trials; 96 'go' and 32 'stop'.Each trial was preceded by a 500 ms fixation cross.Participants were then presented with an arrow pointing left or right, indicating the response they should provide, which remained visible for 1000 ms (the decision period).Using a button box in their right hand, participants were instructed to indicate the direction of the stimulus arrow on 'go' trials.Participants were instructed to inhibit these button presses during 'stop' trials, on which they heard a 500 Hz audio tone.The delay period between the arrow appearing and the auditory 'stop' stimulus for a given trial increased by 50 ms intervals after previous successful inhibition of a motor response and decreased by 50 ms intervals following failed inhibitions.Audio stop signals lasted for 250 ms.Trials were separated by jittered null periods during which a blank screen was presented.Null periods ranged from 500 to 4000 ms, with an average of 1000 ms.

3
It is worth noting that the actual EVs provided in the public dataset linked toGorgolewski et al. (2017) do not conform to the design described in the paper.Each fixed duration is 1 second from the onset, accounting for the fixation cross and only half (500 ms) of the decision period.We have opted not to replicate this approach because (a) it would reduce statistical power by removing part of the trial from the model and (b) changing to the intended duration of 1.5 seconds should serve only to change the magnitude of the effect relative to the original dataset, rather than its qualitative nature (i.e., location of peak activation).coveredby the EVs include null periods between trials, and 'go' trials on which participants provided no response (only 79 participants had any instance of this).As inEsteban et al. (2019), we extracted the contrast of 'go > successful stop', producing a statistical image in which positive values reflected voxels activated by motor response, and negative values reflected voxels activated by successful response inhibition.The reverse of this contrast was also extracted to provide 'successful stop > go'.First-level results were cluster-corrected at a threshold of Z = 3.1 (p = .05).

F
ANOVAs were followed by t-test contrasts comparing each experimental pipeline (P1-P7) to the baseline pipeline (P0), with false (a) -Motor ROI (left primary motor cortex) (b) -Response inhibition ROI (pre-SMA) I G U R E 1 ROIs for use in FSL Featquery, including the left primary motor cortex for motor response activation, and the pre-SMA for successful response inhibition activation.pre-SMA, pre-supplementary motor area; ROI, region of interest.discovery rate correction using the Benjamini-Hochberg method applied.Note that for ANOVAs observing performance metrics (smoothness and task activation), it was not possible to perform sphericity tests or apply sphericity correction given that multiple pipelines (P0, P3, P4) presented with identical values, providing singular sum-ofsquares-and-products (SSP) matrices.While this may lessen confidence in these ANOVAs, post hoc contrasts should still provide an accurate picture of pipeline differences.Each frequentist ANOVA was followed by an equivalent repeated measures Bayesian ANOVA.For each dependent variable, this allowed us to quantify how much more likely the effect of pipeline is to be supported than the null hypothesis (no effect of pipeline).Each ANOVA was constructed using the default coefficient fixed effect prior value (0.5) with a Bayes factor of BF 10 .Each model was compared with the null model.Inference for the interpretation of observed factors was based on the classifications provided by Andraszewicz et al. (2015), originally from Jeffreys

F
I G U R E 3 Estimated carbon emissions (dotted green line) plotted against (a) statistical task activation in regions of interest, (b) estimated data smoothness, (c) duration of preprocessing, and (d) CPU and RAM energy usage, for each pipeline.Error bars reflect one standard error of the mean.These are frequently too small to be visible.Note that the scale used varies between variables, see text below for percent changes.N = 257.CO 2 eq, carbon dioxide equivalent; CPU, central processing unit; kg, kilograms; kWh, kilowatt hours; mm, millimetres; pre-SMA, presupplementary motor area; RAM, random-access memory.F I G U R E 4The mean total size (GB) of all files generated for a given subject, split by 'derivatives' (final output files), 'scratch' (working directory), and the subject-specific 'FreeSurfer' directory, for each pipeline.Error bars are one standard error of the mean, these are too small to be visible.The mean change in size relative to the baseline pipeline (P0) is highlighted for each experimental pipeline.N = 257.GB, gigabytes.
Timeseries SD maps (Figure S9) reflect the sample's mean SD of preprocessed timeseries values for each voxel across all 184 fMRI Fieldmap-free distortion correction Pipeline 7 -Increased output space spatial resolution F I G U R E 5 Activation count maps for each fMRIPrep pipeline.Values reflect the percentage of participants within the sample (N = 257) showing significant individual-level activation in a given voxel for both the 'go > successful stop' (motor; hot colours) and 'successful stop > go' (response inhibition; cool colours) contrasts.Slices presented are at MNI coordinates Z = 4 and Z = 52.AROMA, automatic removal of motion artifacts; ICA, independent components analysis.

5.
At fMRIPrep version 22.1.1,the fieldmap-free distortion correction technique (--use-syn-sdc) can both increase emissions and degrade data quality, based on Description of fMRIPrep pipeline variants.The baseline fMRIPrep command with Table1flags only.All predicted directions of effects below are relative to this baseline.Removes the default FreeSurfer brain surface reconstruction.This can be expected to decrease carbon emissions while not affecting preprocessing performance, as surface reconstruction should not impact volumetric preprocessing.
3Time periods notT A B L E 2 Main effects of pipeline for each dependent variable, for both frequentist and Bayesian ANOVAs.Note: * reflects significant results at p < .05.Due to violation of the assumption of sphericity, Greenhouse-Geisser correction applied to ANOVAs for carbon emissions, duration, CPU energy, and RAM energy, and total file size.It was not possible to perform sphericity tests for smoothness or task activation ANOVAs given singular SSP matrices (see Section 2.5).N = 257.
However, in real terms, this means the energy usage attributable to fMRIPrep is likely somewhat underestimated.Third, carbon emission values reported are specific to the hardware employed in our institutional HPC cluster.Certain effects may be more or less evident on different systems.Fourth, we did not use live carbon intensity data in estimating emissions (see https://carbonintensity.org.uk).Comparison of data across pipelines would otherwise have been confounded, given that carbon intensity varies systematically by time of day and counterbalancing job start time would have been impractical.Fifth, this study does not consider environmental effects beyond electricity, such as water use and hardware manufacturing.Sixth, the analysis conducted here is specific to the output generated by a single version of fMRIPrep Computing was conducted in the southeast of England, with estimated annual average carbon intensity of 193.38 g of CO 2 per kilowatt hour (http://www.carbonfootprint.com).